Network analysis for estimating standardization trends in genomics using MEDLINE

Background Biotechnology in genomics, such as sequencing devices and gene quantification software, has proliferated and been applied to clinical settings. However, the lack of standards applicable to it poses practical problems in interoperability and reusability of the technology across various application domains. This study aims to visualize and identify the standard trends in clinical genomics and to suggest areas on which standardization efforts must focus. Methods Of 16,538 articles retrieved from PubMed, published from 1975 to 2020, using search keywords “genomics and standard” and “clinical genomic sequence and standard”, terms were extracted from the abstracts and titles of 15,855 articles. Our analysis includes (1) network analysis of full phases (2) period analysis with five phases; (3) statistical analysis; (4) content analysis. Results Our research trend showed an increasing trend from 2003, years marked by the completion of the human genome project (2003). The content analysis showed that keywords related to such concepts as gene types for analysis, and analysis techniques were increased in phase 3 when US-FDA first approved the next-generation sequencer. During 2017–2019, oncology-relevant terms were clustered and contributed to the increasing trend in phase 4 of the content analysis. In the statistical analysis, all the categories showed high regression values (R2 > 0.586) throughout the whole analysis period and phase-based statistical analysis showed significance only in the Genetics terminology category (P = .039*) at phase 4. Conclusions Through comprehensive trend analysis from our study, we provided the trend shifts and high-demand items in standardization for clinical genetics. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-022-01740-4.


Introduction
The dawn of the 20th century saw the rise of medical genetics research on humans due to the discovery of Mendelian inheritance disorders [1,2]. Remarkable progress in medical genetics has been made in the latter part of the 20th century, notably in cancer genetics [3]. Especially, research on disease diagnosis using genomic sequencing technologies has gained momentum, thanks to the wide availability of next-generation sequencing (NGS) methods. To use this advanced genetic analysis technology in medical institutions or clinical settings, it is essential to develop a standard procedure that could be commonly used. Various standard guidelines are being developed by industry and international standards development organizations for clinical examination and diagnosis of diseases, such as cancer, leukemia, and tuberculosis [4][5][6][7]. These standards, from such organizations as the American College of Medical Genetics (ACMG), Association of Molecular Pathology, and Microarray Quality Control Consortium [4,[7][8][9], have enabled the active use of various sequencing technologies and methods in the clinic [7].
However, extant standards and their coverage could not be claimed to be sufficient to meet the standardization demand from the market, notably evident in the clinical applications of NGS to disease diagnosis [4][5][6][7]. For the use of newly developed genetic technology to thrive, it is of significant import for standard research to be able to scan the clinical environment of genomics and the recent status/ trend of analysis technology and gather necessary technical resources for standardization, to refine priorities for genomics standardization.
One way to help scan the genomics environment is to apply network analysis on the artifacts of research articles to reveal environmental changes that can be used as guidance for standardization. To explore specific research trends, network analysis using bibliometric data has been widely used and applied to various research domains, for example, genomics [10], public health [11,12], and medicine [13]. Network analysis in this study is used to divulge research trend changes. The identification of such trend changes can enable the research of standards development to construct a strategy to meet the standardization demand from genetic research and clinical practice. In detail, this study uses network analysis (1) to suggest recent genomics trends and narrowed range of topics to keywords showing strong relation in standardization, (2) to examine temporal trends and related critical development which drives changes in trend. Through this study, we intend to derive all development that acts as major factors and indicators to which standards development should be considered.

Study flow
The overall study procedure is shown in [Supplementary file, Figure S1] and summarized as follows: (1) search articles with two Medical Subject Heading (MeSH) terms ("genomics and standard" and "clinical genomic sequence and standard") in PubMed; (2) export PMID numbers; (3) extract keywords from the abstracts and titles of the articles; (4) keyword preparation; (5) development of the network analysis with the keyword frequency matrix; (6) development of the period analysis with the keyword frequency matrix; and (7) categorization of keywords for statistical analysis.

Data source
The MEDLINE database is provided by the US National Library of Medicine and contains various types of scientific literature in biomedical and life science fields [14]. We have used PubMed to freely access to MED-LINE database, and it provides links to the abstracts. To explore research trends of standardization in genomics, we searched two MeSH terms, "genomics and standard" and "clinical genomic sequence and standard", published between 1975 and September 2020. The search returned 16,550 articles that contained various types of research papers, such as reviews, original articles, and perspectives. Of the articles, 10,000 articles were indexed with the search term "Genomics and standard", and 6,550 articles with "Clinical genomic sequence and standard". Of the 16,550 articles, we used 15,855 articles whose abstracts and titles were accessible and written in English.

Keyword preparation
The data preparation was summarized in [Supplementary file, Figure S1]. A total of 36,275 frequency of 5,639 keywords was extracted from 15,855 articles. The keywords were extracted using the TextRank algorithm [15] using Corpus 16,000 from the abstracts and titles of the articles. TextRank algorithm is commonly used to extract single terms from literature, so we used TextRank to extract semantic keywords. By four experts, the keywords were manually screened and reviewed following a set of exclusion criteria referring to previous studies. The exclusion criteria are 1) non-technical terms with three conditions: (a) everyday term which is used in daily life, such as "she", and "others". (b) terms that are not related to or specialized science and technical knowledge, such as "abc", "scientist", "concept" and "consensus". (c) adjectives and adverbs, such as "happy", "firstly", "lastly", and "furthermore"; 2) temporal terms such as months, weekdays, as well as other temporal terms that do not provide precise a point of time and period, such as "April" without year (instead of "April 2004") or "Monday" without year and month; and 3) compound nouns with two conditions: (a) frequencies of a compound noun of whose constituent terms have been already counted individually, such as "genomics proteomics", and "protein gene" AND (b) the compound noun does not constitute a meaningful term, such as "furthermore genes" and "statistically disease". After the manual cleansing, 1,024 keywords were left.
For the synonyms with different spells and the synonyms expressed with different capital or small character, we merged these terms into one abbreviation of capital instead of a spell-out term. All the plural terms were corrected and merged into a singular form.
Because many duplicated compound nouns, such as "HBV HBC" "proteomics proteomics" and "CpG CpG", and meaningless compound nouns with more than three words, such as "genorm bestkeeper normfinder" and "genetics genomics acmg", were automatically generated under 12 frequencies, we set further exclusion criteria for keywords less than 12 frequencies. As we removed keywords following this exclusion criteria, most of the unuseful compound nouns were deleted and it resulted in 330 keywords with a total frequency of N = 16,213.

Network analysis
The overall network analysis was performed following previous studies [16,17]. In network analysis of research articles, a higher frequency of keywords indicates a higher number of relevant research in a particular year. For network analysis, weighted Jaccard similarity value obtained between two keywords was commonly used to evaluate the closeness between the keywords. A network consists of lots of nodes and edges. A node represents a keyword, and an edge represents relatedness between two keywords.
The weighted Jaccard similarity provides edge weight 0 to 1. For example, if the edge weight is 1, two keywords were always used in the same sentence. In this study, we calculated edge, the relatedness between two keywords, by weighted Jaccard similarity using frequencies of the keywords [16,17]. For network analysis, we used keywords frequency data in the full phase. The weight of a node in the network was determined by the PageRank algorithm [18], and a community detection algorithm [19] were used to cluster keywords. When PageRank calculates node sizes, it considers edge weights. In this study, the PageRank, and the community detection algorithm based on the modularity of optimization were conducted via Gephi 0.8.2. The node size was displayed by the PageRank score, and the color of an edge was presented by the modularity value. According to the derived values, the network model of the relationships between keywords was visualized via Gephi.
The similarity between keywords and between publication year.
The relatedness between keywords is represented by the similarity obtained via the weighted Jaccard similarity equation shown below.
First, a two-dimensional annual frequency matrix (Supplementary file 1, Figure S1) was generated with a frequency of each term by publication years -a matrix of 330 (the number of keywords) x 46 (the number of publication years, from 1975 to 2020). In the following equation, for the network analysis, S and T represent two keywords, and Krepresents the ordinal number of keywords S and T. Based on the matrix, we calculated the similarity value between the two keywords using frequency data in a row. For example, when we calculate similarity between keywords "AAV (S)" and "Abi (T)", the frequency data for the keywords are: S = {0, …, 1, 0} and T = {1, …, 1, 0}. Using these input data, we obtained the similarity value of J (S, T) = (0 + … + 1 + 0)/(1 + … + 1 + 0). For the period analysis, we used frequency data in a column of each publication year to calculate the similarity between publication years. For example, the similarity between 2019:2020 is calculated with the frequency of 2019 (S) and 2020 (T): S = {1, 1, 0, 1, 8, 2, 4, 0, …} and T = {0, 0, 1, 1, 2, 0, 0, 0, …}. Thus, the similarity value between 2019:2020 is J (S, T) = (1 + 1 + 0 + 1 + 8 + 2 + 4 + 0 + …)/ (0 + 0 + 1 + 1 + 2 + 0 + 0 + 0 + …). The maximum similarity value is 1.0, and as the similarity is increasing, two keywords in the network analysis or two publication years in the period analysis present a high match.

Period analysis
To observe when the research trend changed, a similarity analysis was performed between years. Through period analysis, we identified the change point when the similarity graph was steeply curved. This will aid in exploring the social events that affect research trends. We calculated the differences between the year of similarities to identify the local minimum and the local maximum points. Before and after of the relatively larger difference value [red color in Supplementary file 2], the local minimum and maximum points were identified [Supplementary file 2, green colored].
To be more precise about the local minimum and maximum points, we analyzed three types of similarity analysis for the period analysis: 1) The similarity between two publication years (e.g.,  3) The similarity between two similarity values with 2-year of interval (e.g., similarity between similarity values of 2000:2002 and 2001:2003).
Please note that phase 0 (1975-1999) was not included in the analysis, due to the low-frequency values (frequency of 10 to 72).
We submit that a local minimum and maximum point in similarity provides an indicator that there has been a significant development or event that deserves the attention of standards development communities.

Content analysis
Through content analysis, we reviewed terms following our previous research [16,17], and additionally, in this study, we classified keywords into a related research area. First, the 330 keywords were classified into academic categories, and further, the same 330 keywords were classified into other subcategories [Supplementary file 2, Content analysis sheet]: 1) The keywords were sorted into six academic categories: Biology, General, Genetics, Medicine, Proteomics, and Statistics. For example, "Escherichia", "animal", and "Arabidopsis" were sorted into the Biology, "Illumina", "allele", and "rRNA" were in the Genetics, "precision", "therapy", and "diagnosis" were in the Medicine, "peptide" "omics", and "QconCAT" were in the Proteomics, "Bayesian", "algorithm", and "Gaussian" were in the Statistics. The keywords in the General category can be used in other academic fields. For example, "database" "knowledge" and "measurement" could be used in any other field in Biology, Genetics, and Medicine. So in this case, keywords were classified into the General category.
All the keywords category lists were in [Supplementary file 2, Content analysis sheet].

Statistical analysis
To evaluate statistically linear trends, the generalized linear model has been commonly used in review and research articles [20,21]. In our study, a linear regression analysis was performed with keyword frequencies and publication years for each category to examine the relationship between phases. The sum of the publication year within a phase was calculated in the five phases, including phase 0 to derive phase-frequency data. The academy categories and subcategories were represented as fixed factors. And the five phase-frequency lists were used as the dependent variables. Using these variables, we performed a univariate generalized linear model (GLM) to statistically estimate the research trends of each phase.
For the GLM, we conducted a parameter estimation in each of the 6 academy categories and 12 subcategories within each phase. SPSS Statistics ver.26, IBM was used for the statistical analysis.

The network analysis
The network is displayed in Fig. 1 with keywords derived from studies published from 1975 to 2020, using eight colors following a modularity of 0 to 7. According to the modularity value, full-phase keywords were clustered in different colors ( Fig. 1; Table 1).

Period analysis based on publication years
For period analysis, we selected three local minimum / maximum points using a large difference between the similarities of publication year [Supplementary file 2] to define the patterns of the keyword appearance.  Table 2. From above the periodic analysis, we identified the main three points, where the critical issues regarding standardization in genomics occurred. Fig. 1 Network connectivity between keywords for the total period (1975-2020). In the network, the total frequency of 338 keywords is 16,213. The color of an edge represents the same similarity value and represents the cluster. Each keyword has one node and a keyword, and it may have many edges to and from other keywords. A node size was determined by a PageRank score. The edge is displayed over 0.5 threshold of modularity
We examined the trend of each term from phase 0 to phase 4 in subcategories as follows:  In [Supplementary file 1, Figure S2], "Escherichia" showed the highest frequency in phase 2, and "Mycobacterium" in phase 4. In Statistics, "Bayesian" and "algorithm" were of the highest frequency in phase 2, while the frequency of the latter steadily decreased until phase 4. The frequency of "Bayesian" increased from phase 3 to 4.
In the Company/Consortium graph, "Illumina, " "Taqman" were of the highest frequency at phase 4, and "Illumina" and "ACMG" showed an increasing trend during the whole period. In Database, the term "bioinformatics" showed the highest frequency at phase 4. In Gene, the terms "gene", "genome", "allele", "codon", "cDNA", "chromosome", "DNA", and "mtDNA" exhibited the highest frequencies at phase 2 and started to decrease in frequency from phase 3 to phase 4.

Statistical analysis The linear regression over the period
To evaluate linear trends, linear regression was conducted with keyword frequencies for publication years from 1975 to 2020. Although 2020 showed a decreasing trend in the academic categories and subcategories, all the categories showed high regression values (from 0.586 (Company/Consortium) to 0.764 (Biology)) as shown in Table 3; Fig. 2. All the categories showed an increasing linear correlation between keyword frequencies and publication years.

The generalized linear model within a phase
Because the linear regression analysis without phase demonstrated a high correlation (R 2 > 0.585) in all categories, we conducted linear regression within a phase in each category. To analyze phase-based linear analysis for each category, we performed GLM evaluation based on phases ( Fig. 3; Table 4). There was no significant linear correlation found in the academic categories (Supplementary file 1, Table S1) while significant linear correlations were observed in several subcategories (Table 4): Gene (P = .003) and Pathogen (P = .030) showed a significant in phase 0, and Gene (P = .004) and Proteomics (P = .044) showed a significant phase 1. In phase 2, only Proteomics (P = .001) was significant, in phase 3, Proteomics (P = .045) and Software (P = .004) were significant, and in phase 4, only Genetics terminology was significantly fitted with the linear model (P = .039).

Discussion
In this study, we have investigated the trends in clinical genetics from 1975 to 2020. Through the network analysis, we have obtained clusters with a strong relationship between terminology from M0 to M7 as follows, respectively: M0) clinical use of bioinformatics and analysis technology; M1) gene analysis objects, methods, and software; M2) oncology regarding diagnosis, treatment, and tumor disease; M3) gene database and analysis tools regarding pathogens; M4) The DNA methylation-related disease and gene analysis; M5) gene-related terms including phylogenetics; M6) proteomics and its analytical terms; M7) gene analysis objects in clinical laboratory. As the clinical application of cutting-edge technology increases, research items with high requirements for standardization are being revealed, and the scope seems to be narrowing down to gene analysis, genetic materials, living organisms (i.e., biological objects), bioinformatics, and proteomics. Interestingly, diseases in which standardization is often mentioned or is showing high demands for standards are prominent in clinical practice have been discovered, such as oncological diseases such as tumors and cancer, and DNA methylation diseases such as acute myeloid leukemia (AML) and glioblastoma. Through period analysis, it was possible to know at which point the standard trend in the field of clinical genetics changed, and through content analysis, it was possible to find out which keywords increased at the point revealed through period analysis.
For instance, in April 2003, the Human Genome Project, the world's largest collaborative biological project from 1990, was completed [22], ramifications of which seemed to have been reflected in the trend shift at phase 2. In a comprehensive review of the content analysis and network analysis results, an increasing appearance of genetic analysis terms such as "qPCR", "microarray", "electrophoresis", and "Taqman" were observed at the point.
Another example may be gleaned from an event in 2013, the approval of Illumina's sequencer by US-FDA [23] in 2013. An increasing trend shift was observed at phase 3 in the form of increased frequencies of sequencing-related terms ("miRNA", "rRNA"), devices ("Illumina", "MiSeq"), and analysis technique/software ("WGS", "GWAS", "geNorm", "NormFinder"). Although the events in which MiSeq of Illumina was launched in 2011 and HiSeq 2500 of Illumina sequencer was launched in 2012, the social influence of FDA approval has seemed more affect the standardization of research in genomics than the launching of device.
Taking the content analysis and statistical analysis results together, we suggest that these genetics terminologies, especially gene analysis technology including biological objects highly to increase in future trends and could be promising standard research topics in clinical genomics. Plus, considering the results of this study, when selecting standard items with a ramification in clinical genetics, we suggest considering the FDA approval that can increase their use in clinics to prioritize genetic technologies.

Limitations
As the title says, this study was mainly conducted with network analysis and periodic analysis. And we performed the content analysis and statistical analysis to give scientifically supportive results for the main analysis results. The limitations of each analysis are as follow: (1) For the network analysis results, we reviewed only ten keywords in each modularity. For a more precise interpretation of the results, all the keywords should be reviewed in each cluster in future research. (2) A more objective basis for the relation between period analysis and social events should be provided. (3) For content analysis and statistical analysis, it could be more appropriate to use modularity values rather than keywords characteristics of categories. In future research, if we conduct keyword analysis research considering the limitations, we will be able to improve the quality of research.

Conclusion
Despite the steep decreasing number of keyword frequency in 2020 caused by the downturn of genomics research because of the pandemic status of COVID-19, the overall research field related to the standard of genomics showed a significantly positive trend from 1975 to September 2020 (R2 > 0.585, Table 3; Fig. 2).
In the GLM analysis within a phase, Genomics terminology keywords regarding methylation terminology are showing a significantly increasing trend (P = .039) with clinical terminologies of DNA methylation diseases, such as AML and GBM. Also, from the period analysis results, we revealed other influential issues of genetics, such as the completion of the human genome project in 2003, the approval of NGS by the US-FDA in 2013, the outbreak of the COVID-19 pandemic in 2020, and these social events seem to have considerably influenced the standardization research in genomics. Through this comprehensive network analysis study with a period, contents, and statistical analysis, we could provide various types of information such as the relationship between terminologies, the most influential social issues in a standard of genomics field, and trend shifts in genomics terminology fields. Moreover, we statistically estimated and suggested future trends and provided high-demanding items in international standardization for clinical genetics. Therefore, the genomics trend analysis results of this study can be used as a guidance for directing future standards development efforts in clinical genomics.

Acknowledgements
Thanks to So Young Shim for the data preparation from data export to keyword removal, and we thank Seung Hwan Jeong for removing keywords in the data preparation procedure.

Authors' contributions
Sun-Ju Ahn initiated the study and the research project. Sun-Ju Ahn, Eun Bit Bae & Se Jin Nam designed methodology; Eun Bit Bae & Se Jin Nam constructed overall and detailed study design, and prepared keyword data; Se Jin Nam conducted network analysis and period analysis; Eun Bit Bae conducted the content analysis, period analysis, statistical analysis, interpreted

Data Availability
All data generated or analyzed during this study are included in this published article and its supplementary information files.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.